Data availability statement

The dataset for this activity is available in the file XX_NAME_OF_DATA_FILE.csv shared via Github (https://github.com/MarieAugerMethe/CANSSI_HMM_wk_dthon_2025). It can be used for our CANSSI/UBC datathon. It cannot be used for any publication without written permission. The data was received from Mapbox Inc (https://www.mapbox.com/). through their educational partnership program. Please contact Asim Khanal (khanal.asim73@gmail.com) for more information.

Let’s load some of the packages needed.

library(tidyverse)
library(sf) # For spatial data
library(mapview) # To quickly map

Description of the dataset

Mapbox Inc. is a third-party source that provides movement data based on anonymized mobile GPS signals. In Mapbox datasets, visitation is indicated by the “activity index”, which is based on the density of smart devices within a 100 x 100 m grid cell. Mapbox Movement data are aggregations of movement activity for a given time span and geographic area. Thus, a high activity index within a grid cell can be both from a high density of smart devices or continuous movement of the same smart devices. In addition, the data is normalized for a given country (i.e., Canada) and each time interval is scaled to a baseline of the mean activity patterns for January of the respective year, making the obtained activity index unitless without any real-world equivalent, such as the density of people, number of smart devices, or time spent in an area.

Here, we selected 13 cells found in a park and assumed that if Mapbox Inc did not report a activity value for a given day, the activity level was 0.

Let’s read the data into R and have a quick peak at it.

parks_activity <- read.csv("van_13_gps_2022.csv")
# Make cellID a factor
parks_activity$cellID <- as.factor(parks_activity$cellID)
# Make date a date object
parks_activity$date <- ymd(parks_activity$date)
head(parks_activity)
##   cellID       date      lon      lat activity_index_total
## 1      1 2022-01-01 -123.016 49.22432             0.123503
## 2      1 2022-01-02 -123.016 49.22432             0.000000
## 3      1 2022-01-03 -123.016 49.22432             0.000000
## 4      1 2022-01-04 -123.016 49.22432             0.082034
## 5      1 2022-01-05 -123.016 49.22432             0.102824
## 6      1 2022-01-06 -123.016 49.22432             0.105494
##                                    bounds
## 1 -123.01666,49.22388,-123.01529,49.22477
## 2 -123.01666,49.22388,-123.01529,49.22477
## 3 -123.01666,49.22388,-123.01529,49.22477
## 4 -123.01666,49.22388,-123.01529,49.22477
## 5 -123.01666,49.22388,-123.01529,49.22477
## 6 -123.01666,49.22388,-123.01529,49.22477

As you can see, the dataset contains these variables:

Quick visualization

Let’s map the cells that were selected.

# Create spatial object to map quickly
parks_sf <- st_as_sf(parks_activity, coords = c("lon","lat"), remove=FALSE)
st_crs(parks_sf) <- 4326 #Coordinate system: WGS 84

mapview(parks_sf, zcol = "cellID",
        layer.name=" ")

Let’s look at the change in activity level through time.

plot_act <- ggplot(data = parks_activity) +
  geom_line(aes(x = date, y = activity_index_total)) +
  facet_wrap(~ cellID, ncol=3) +
  ylab("Activity levels")
plot_act

Datathon goal

The goals are (1) to provide a road map of how to tackle the questions listed below, (2) attempt to complete at least the first step of your road map, and (3) provide interpretation of the results. Make a quick 5-minute presentation explaining what your team did.

Questions:

The general goal is to understand what may be affecting the visitation patterns and see if we predict the visitation patterns for subsequent years. Given the dataset we have at hand we can explore the following questions:

  • Are there weekly or seasonal patterns in the data?

Here the main interest is identifying patterns in visitations. Three hidden states we may be interested in identifying could be: low, medium, and high visitation.

Things to think about: Cells may differ in what is high levels, as such in may be important to consider things like random effects emission distributions. Especially as the number of cell included increases, it may be important to look at unexplained spatial autocorrelation.

Acknowledgments

If the analysis you developed with your team, or a derivative of it, is used in your thesis or in a publication, please add the name of our research team and of your teammates in the acknowledgements. Please also let Marie Auger-Méthé () and Vianey Leos Barajas () know, as we will be excited to see that our datathon has provided you with concrete help! The name of our team is: CANSSI: Advancing Statistical Methods for the Analysis of Complex Biologging Data Collected from Humans and Animals.